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Abstract 

Haplotype resolved genome sequence information is of growing interest due to its applications in 
both population genetics and medical genetics. Here, we assess the ability to correctly reconstruct 
haplotype sequences using fosmid pooled sequencing and apply the sequences to explore 
historical population relationships. We resolved phased haplotypes of sample NA 19240, a trio 
child from the Yoruba HapMap collection using pools of a total of 521,783 fosmid clones. We 
phased 93% of heterozygous SNPs into haplotype-resolved blocks, with anN50 size of 318kb. 
Using trio information from HapMap, we linked adjacent blocks together to form paternal and 
maternal alleles, producing near-to-complete haplotypes. Comparison with 33 individual fosmids 
sequenced using capillary sequencing shows that our reconstructed sequence haplotypes have a 
sequence error rate of 0.005%. Utilizing fosmid-phased haplotypes from a Yoruba, a European 
and a Gujarati sample, we analyzed population history and inferred population split times. We 
date the initial split between Yoruba and out of African populations to 90,000-100,000 years ago 
with substantial gene flow occurring until nearly 50,000 years ago, and obtain congruent results 
with the autosomes and the X chromosome. We estimate that the initial split between European 
and Gujarati population occurred around 45,000 years ago and gene flow ended around 28,000 
years ago. Analysis of X vs autosome inferred effective population sizes reveals distinct epochs 
in which the ratio of the effective number of males to females changes. We find a period of 
female bias during the ancestral human lineage up to 1 million years ago and a short period of 
male bias in Yoruba lineage from 160-400 thousand years ago. We demonstrate the construction 
of haplotype sequences of sufficient completeness and accuracy for population genetic analysis. 
As experimental and analytic methods improve, these approaches will continue to shed new light 
on the history of populations. 
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Introduction 

DNA resequencing technologies have made it possible to identify genetic variation across 
thousands of individuals. However, most resequencing studies of this kind are "phase-insensitive", 
providing a mixed readout of diploid genomes that neglects the haplotype configuration of two 
homologous chromosomes. Haplotype information is key to understanding the relationships 
between genetic variation and phenotype, such as how cz's-acting eQTL affecting gene expression 
and whether combinations of heterozygous variants lead to additional combined phenotypic 
effects (reviewed in [1]). Haplotypes are also informative for inferring genetic ancestry [2-5] and 
reconstructing population history [6-8]. Recent studies have also identified haplotypes of 
Neanderthal ancestry and examined how archaic introgression shaped our genomes [9-12]. 

Haplotypes are often inferred by statistical methods that utilize population genotype data 
to model the haplotype pairs of an individual as an imperfect mosaic of other haplotypes [13]. 
However, this approach is applicable to common SNPs rather than rare or individual-specific 
variants [14]. Trio-based phasing is more accurate but sometimes unavailable and is uncertain for 
the phase of variants when all individuals are heterozygous. Several experimental phasing 
methods are now available. Although full genome sequencing of individual haploid cells, such as 
sperm, is now possible [15,16], most approaches obtain sequence information from individual 
haplotypes by physically separating DNA fragments [17]. These approaches include use of clone 
end-sequence pairs [18-20], analysis of clone pools [21-23], or library creation from dilute pools 
of single molecules [24,25]. Approaches based on other concepts, such as the spatial structure of 
chromosomes in the nucleus [26], are also being developed. 

Here, we construct long-range haplotypes by sequencing fosmid pools where each fosmid 
represents ~35kb of haplotype-specific sequence [22]. Heterozygous variant positions within 
overlapping clone fragments were then used to assemble contiguous haplotypes. Fosmid-based 
haplotyping can usually achieve N50 block greater than 300kb, depending on the density of 
heterozygous SNPs and the number of clones sequenced [27]. Additional information from SNPs 
phased by trio transmission can further link blocks together, producing near-to-complete 
haplotypes. The haplotype of a Gujarati Indian Individual (NA20847; GIH) was first resolved by 
using this method [22], followed by NA12878, a HapMap trio child from the CEU population 
[23]. Here, we describe the physically phased genomes of NA19240, from the YRI population. 
We assess the ability to correctly assemble SNP haplotypes and to accurately integrate other types 
of sequence variation into a coherent picture of diploid genomic structure. 

Utilizing phased information from individuals from three different populations, we infer 
population history and population split times based on patterns of haplotype coalescence using 
the Pairwise Sequentially Markovian Coalescent (PSMC) model [28]. The PSMC approach is 
based on the distribution of coalescence times, or time to the most recent common ancestor 
(TMRCA) for hundreds of thousands of loci along a chromosome that have independent histories 
due to historic recombination events. Typically, these TMRCA values are calculated for the two 
haploid genomes which compose the diploid genome of a single sample. Utilizing haploid 
genomes from distinct individuals, the method infers TMRCA distributions across populations, 
which is informative about the timing of population splits. This information is complementary to 
that used in other methods, such as inference based on the site frequency spectrum [29,30], and 
permits direct comparisons of inferred estimates on the autosomes and X chromosome. 
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Comparative analyses of genetic diversity between the autosomes and the X chromosome 
provides estimates of the ratio of effective population size (N E ) of X chromosome to autosome, 
with the expectation that this ratio should be 3 /4 given equal male and female effective population 
size and constant population size throughout history. However, several studies have observed a 
larger X-vs-autosome diversity ratio in African populations, indicating female bias (larger female 
effective population size) [31-33]. Deviation from the expected ratio of % may result from sex- 
biased demographic events [34] or skewed breeding ratio [35] leading to different Ne of males 
and females, as well as other processes such as natural selection acting differentially on X and 
autosomes [36]. When applying PSMC on autosomes and X chromosome, we can obtain direct 
estimates of X and autosome effective population size at each time point and infer time period for 
potentially sex-biases. 

Results 

Fosmid Pool Sequencing 

We constructed two independent fosmid libraries from sample NA 19240, split into a total 
of 288 pools each containing -2,000 independent clones (Supplemental Table 1). Barcoded 
sequencing libraries were constructed form each pool using the Nextera sample library generation 
protocol [37] and sequenced to an average depth of ~3 across six lanes of an Illumina HiSeq. 
(Supplemental Table 1). We mapped the resulting reads to a version of the human genome 
reference sequence that included the GrCh37 reference as well as sequence from the E. coli 
genome, the fosmid vector backbone, and Epstein Barr Virus (EBV). To define fosmid clones, we 
identified islands of increased coverage in contiguous 1 kb windows filtered by length (lOkb to 
50kb) and read depth(above 0.25). This procedure identified 521,783 individual clones (median 
length 34.0kb), with median lkb coverage of 1.95. In total across all pools, 96.4% of the genome 
was covered by at least one clone, with an average physical coverage of 5 clones and a median 
sequence coverage of 17. 9x (Supplemental Figure 1) 

Haplotype Resolution 

We identified 2,405,813 heterozygous SNPs in NA19240 based on 15x whole genome 
sequencing from a standard and Nextera-based whole genome library. Heterozygous SNPs 
showed a concordance of 99.2% with 1000 Genomes [38] and 99.5% with HapMap [39]. Non- 
reference sensitivity for all SNPs was 88.4% compared with HapMap variants, and 87.1% 
compared with 1000 Genomes, while heterozygous SNPs showed a concordance of 99.98% with 
1000 Genomes and 99.5% with HapMap (Supplemental Table 2). The genotype inferred for each 
of these heterozygous SNP positions in each of the inferred clones served as the input for 
reconstructing phased haplotypes. We utilized the ReFHap algorithm [27], previously 
demonstrated to have superior performance on this type of data, and obtained 17,388 haplotype- 
resolved blocks, with N90 block size of 71.2kbp, an N50 of 318.5kbp and an N10 of 1.04 mbp, 
and phased 92.84% of the 2,405,813 heterozygous SNP genotyped by whole-genome sequencing. 
For comparison, we reanalyzed published data from sample NA20847 by mapping whole genome 
and fosmid pool sequence data to the GRCh37 assembly and inferring phase using this pipeline 
(Figure 1). Results are similar to those previously obtained, with NA 19240 having longer phased 
blocks due to increased density of heterozygous sites in African relative to GIH samples [40]. 
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Although SNPs within each block are phased, the relationships between blocks cannot be 
directly established due to the absence of linking fosmid clones. We used using phase information 
determined by transmission at informative sites in the HapMap trio data for NA 19240 to 
overcome this limitation [41]. Using this data, we assigned our ReFHap haplotypes within each 
block to the paternal and maternal derived chromosomes. This linked adjacent blocks together 
producing near-to-complete haplotypes. In total, 92.8% of blocks were successfully assigned to a 
parental allele. Comparison with the SNPs deterministically phaed using HapMap trio data 
identified 506 switch errors within our inferred haplotypes, which we corrected prior to 
subsequent analysis (Table 1). A switch error is an inconsistency between an assembled 
haplotype and the real haplotype between two contiguous variants. Examination of the errors 
indicates that most of the switch errors are due to insufficient clone support when linking variants 
together as ReFHap will assemble variants even when there is a single clone overlapping two 
variants (Supplemental Figure 2). For GIH sample NA20847, we linked together phased 
haplotypes within ReFHap blocks based on the statistically inferred haplotypes from the HapMap 
project [40], and randomly assigned haplotypes for those blocks without an overlap to an 
informative SNP. This yielded 2,751 apparent switch-errors, which we infer to be mostly errors in 
the haplotypes inferred statistically [22]. To further assess our overall haplotype accuracy, we 
compared our phased SNPs for NA 19240 with 1000 Genomes heterozygous SNPs phased based 
on trio transmission [38], and found 99.7% concordance (Table 1). 

Based on the reconstructed SNP haplotypes, we assigned individual clones to the paternal 
or maternal alleles. This procedure assigned 420,637 clones, while 5,376 clones could not be 
assigned because some RefHap blocks contained no HapMap or 1 000 Genomes informative 
SNPs. We created haplotype-specific BAMs by extracting the reads corresponding to each fosmid 
clone. From these BAMs, we recreated haplotype-specific sequences, resulting in the 
identification of 186,507 additional heterozygous SNPs missing from our original heterozygous 
SNP call set. Among the additional SNPs, 76,726 (41.1%) were initially identified by our 15x 
whole genome sequence data, but were filtered by the Variant Quality Score Recalibration 
procedure implemented in GATK, and 82,136 SNPs (44% of the additional SNPs) overlapped 
with the 1000 Genomes call set. 

To assess the sequence accuracy of inferred haplotypes we compared the haplotypes 
inferred from the haplotype specific BAMs with the sequence of 33 fosmid clones from the same 
individual that were previously sequenced using standard capillary sequencing [42]. Based on 
heterozygous SNPs, we assigned each finished clone sequence to the maternal or paternal 
haplotype, and compared the clone sequence with that inferred from our data (Supplemental 
Table 3). Our inferred sequence differs at 13 of the 1,107 heterozygous sites (1.1%) encompassed 
by the 33 clones. In total, the aligned clones encompass 1,144,737 bp excluding alignment gaps, 
and have 56 single nucleotide differences in comparison with our data. If we assume that all of 
these differences are errors in our inferred sequences, this suggests that our haplotypes have an 
overall sequence error rate of less than 0.005% or a Phred [43] quality greater than Q40. 

Comparison with other data sets 

Comparisons among sequence call sets continue to have substantial disagreement, an 
observation that likely result from differences in filtering, the amount of the genome accessible to 
different technologies, and algorithmic differences in read mapping and variant identification 
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[44,45]. Since sample NA19240 has been studied by a range of technologies, we compared our 
fosmid-derived call set with previously published data on the same sample [38,39]. We 
specifically focus on heterozygous SNPs, as they are more error-prone and important for our 
further analysis. We compared heterozygous SNPs from the 1000 genomes project, Complete 
Genomics, our call set based on WGS and fosmid pool sequencing, and high-density SNP array 
data from the Affymetrix Axiom array (Figure2A). Although most variants were identified by 
multiple platforms, there are a substantial number of calls unique to a single approach. For 
example, there are 236,613 heterzygous SNPs identified by Complete Genomics that are not 
identified by the others; when variants reported as low quality are excluded, this number is still 
198,988. Our fosmid haplotype calls also include 128,329 heterozygous SNPs not reported by 
other studies. Of these, 85,119 are also identified by our 15X whole genome sequencing. 

We also performed indel discovery using GATK. We intersected calls from whole 
genome sequencing, fosmid pool analysis, as well as the 1000 Genomes. We required sites to 
have the same position and also the same reference and alternative alleles, and identified 147,222 
indels shared by all three datasets (Figure2B). We additionally determined indel phase from our 
fosmid data and intercepted it with indels predicted by 1000 Genomes Project and phased based 
on trio transmission. In total, we successfully phased 383,480 indels, of which 54.9% are 
heterozygous, and 50.3% were also confirmed by 1000 Genomes on the same individual 
(Supplemental Figure 4, Supplemental Table 4). This comparatively lower overlap among indel 
call sets highlights the continued challenges of current approaches for accurate indel calling. 

Population split time estimation 

We utilized phase-resolved genome sequence data from three individuals to analyze the 
split- times of three human populations. This includes two individuals, NA 19240 from the YRI 
population and NA20847 from the GIH population, constructed using the above approach as well 
as phased haplotypes for NA12878 from the CEU population, previously published using a 
similar approach based on AB SoliD Sequencing of fosmid pools [27]. The PSMC method can 
infer changes in population size over time by the distribution of TMRCA for the two 
chromosomes within an individual [28]. We first compared PSMC curves based on our 
constructed haplotypes for NA 19240 with PSMC curves obtained from 1000 Genomes 
sequencing data and Complete Genomics sequencing data. Our constructed haplotypes recovered 
nearly the same history as that obtained from 1 000 Genomes sequencing data while Complete 
Genomics gave a slightly increased curve due to higher heterozygous variants called in their 
sequences (Supplemental Figure 5). Interpretation of coalescence times requires a calibration of 
mutation rates and generation times. For all subsequent analysis, we assumed a human mutation 
rate of 1.2x10-8 bp per generation and 25 years as generation time, although results can be easily 
rescaled for comparison with other estimates [46]. The PSMC curves of three individuals 
revealed that all populations shared same N e history prior to 300 kyr ago, as previously reported 
[28,47]. The inferred Ne of YRI population began to differentiate from non-African populations 
around 240 kyr ago and experienced a milder bottleneck and recovered earlier than non-African 
populations (Figure 3A), although we note that the simulations indicate that such shits in PSMC 
curves may overestimate the timing of population size changes [1 1,28]. CEU and GIH 
populations shared a similar history before 30 kyr ago. Such observations were equivalent to 
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previous PSMC analysis on diploid genomes after adjusting for differences in assumed mutation 
rate [28]. 

If population splits are total and sudden, no coalescent events will happen after the split 
time. When applying PSMC on a pseudo-diploid individual where one chromosome comes from 
one population and the second chromosome comes from another population, the time when the 
PSMC estimate of N e goes to infinity provides an estimate for the population split time [28]. To 
test the performance of PSCM on split-time estimation we first applied the method on sequences 
simulated using the model of human population history inferred by Gravel et al. [30]. As 
expected, for simulations with no migration after the split times, the PSMC curve goes to infinity 
directly after the split event (Supplemental Figure 6C-D). However, when migration continues 
after the after the split, the PSMC curves goes to infinity in a step-wise manner, with the timing 
of the initial increase approximating the initial split event. To further explore if PSMC can infer 
the time when gene flow ceases, we simulated sequences following Gravel's model but with gene 
flow ending at different times. Instead of relying on the time when inferred Ne goes to infinity, 
we examine the tail of the estimated TMRCA distribution (Supplemental Figure 7A). Even for a 
history with no migration after the split, the TMRCA distribution includes some coalescence 
events after the split-time, reflecting stochastic effects in the inference procedure. However, we 
observe that the point after which less than 0. 1% of coalescence events occur approximates the 
end of gene flow (Supplemental Figure 7A). 

Next, we applied this method to real data using single haploid genomes from two 
different populations. The resulting PSMC curves do not go to infinity immediately, but increase 
by steps, suggesting substantial gene flow following the initial split. Using the time when PSMC 
curves from pseudo-diploids first increase as an estimate for initial split event, we estimate 
African and non- African population initial split times of 92.7-97.0 thousand years ago, while 
European and GIH populations initially split more recently, around 44.5-46.5 thousand years ago 
(Figure 3B-D, Supplemental Figure 5B-D). We then looked at the tail of the inferred TMRCA 
distributions (Supplemental Figure 7B) and used the time when the tail percentage is less than 
0. 1 % as an estimate of when gene flow ended. We estimated that African and non- African 
population had substantial gene flow until 50.9-53.4 thousand years ago, and detected gene flow 
between CEU and GIH populations ending 27.2-28.4 thousand years ago. We also applied these 
approaches to pseudo-diploid X chromosomes, scaling the effective population size by % and 
using a male-to-female mutation rate ratio a of 2 to account for mutation rate differences between 
males and females. We found that in general the split-time estimates inferred for the X 
chromosome are similar to those observed for the autosomes. We obtained 95% confidence 
intervals by performing PSMC on bootstrapped sequences sampled 100 times and summarized 
the result in Table 2. Our results suggest a long period of genetic exchange after the initial split 
between the Yoruba and non- African ancestors that began 100,000 years ago, and lasted for over 
50,000 years, with less than 0.1% of the genomes coalescing more recently than that time. 
However, for the CEU and GIH populations, this process was much faster, with only 17,000 
years of detected genetic exchange. We note that our results using phase-resolved genomes are 
broadly consistent with those obtained by a recently described extension of the PSMC model 
applicable to multiple samples, although our use of only two haplotypes limits our ability to 
resolve population sizes more recent times [48]. 
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X to autosome population history differences 

If the effective population size of males and females is equal, scaling the N e inferred from 
X chromosomes by 0.75 and scaling the mutation rate according to the male-to-female mutation 
rate ratio a, will yield a similar population history as that inferred from autosomes. Using 6 YRI 
and 4 CEU individuals sequenced by Complete Genomics [49], we compared populations 
histories inferred using the X chromosome and the autosomes. We calculated values for Q, the 
ratio of effective population size of X chromosome to autosome for each time interval. We 
observe a strong female bias in the ancestral human lineage from 2.4 million years ago to 1 
million years ago with a mean Q value of 0.848, significantly higher than 0.75 (p = 2.2* 10" 16 ). 
We also observed a period of male bias in the YRI lineage for 1,000 generations starting 400 ky 
ago with mean Q value 0.604, significantly lower than 0.75 (p =1.8*10" 9 ). However, the CEU 
lineage does not appear to have male bias during this period, with mean Q value 0.76 (p = 0.78) 
(Figure 4A-B). 

Discussion 

Haplotype information is essential for population genetics analysis, such as ancestry 
mapping, population structure inference [2], and detection of signals of natural selection [50]. 
Phased haplotypes from modern humans have been used to find archaic introgression of 
Neanderthals and to compare the proportion of introgression among different populations [1 1]. 
When adding full haplotype information for individuals from different populations, we can date 
historic events regarding two populations by the distribution of TMRCA of two alleles, one from 
each population, as has been done to date the split time of the Neanderthal and modern human 
lineages [11]. Here, we utilized this method to date the split event between Africans and Non- 
Africans and between CEU and GIH. Our results indicate that the separation of the studied human 
populations was a gradual event, with substantial genetic exchange continuing after an initial split, 
a finding consistent with hypotheses of long-standing ancient population structure in Africa 
(reviewed in [51,52]). Our estimates of initial split times are broadly consistent with those 
obtained using other methods, after adjusting for mutation rate differences (Supplemental Table 
5). Our observations are also comparable with recently published paper extending PSMC to 
multiple individuals, allowing population history inference more recent than 30,000 years ago and 
split- time estimates based on cross-population coalescence rates [48]. However, Schiffels and 
Durbin infer that YRI-non-African population separation took place over a longer time period 
(roughly 100,000 years), and also observe a longer period of exchange between the CEU and GIH 
populations. As new methods are developed, and additional experimentally phased genomes 
become available, we expect that the inferred picture of human population history to become 
increasingly elaborate and accurate. 

Several studies have focused on comparing the genetic diversity on the X chromosome 
and autosomes to learn about differences in the demographic histories of males and females 
[31,32,53-55]. A recent study evaluating 569 females across 14 populations found that X/A 
diversity was in the range 0.76-0.77 for the African continental group, and 0.64-0.65 for five 
populations of European ancestry [32]. Summary statistics based on diversity levels and 
population frequency differences are sensitive to sex bias at different time scales, and it has been 
proposed that female bias along the ancestral human lineage and a male bias before the split 
between European and Asian populations can explain some seemingly contradictory observations 
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for the X/A effective population size ratio [33]. Another study comparing relative recombination 
rates on the X and autosomes also supported female bias in all three HapMap populations [35,56]. 
Our observation also supports a female bias in the ancestral human lineage from 2.4 million years 
ago to 1 million years ago. However, we also observe a period of male bias in YRI lineage for 
1,000 generations from 400ky ago, absent in CEU lineage. 

PSMC analysis gives a picture of sex bias at different time periods. To reconcile these 
observations, we determined the fraction of the X chromosome with coalescence times during the 
distinct epochs of sex-bias. For the YRI, we find that a higher proportion of the X chromosome 
sequences coalesced during the time-span of female bias than during the male bias time-span 
(33% and 14%, p = 2.9* 10" 7 ), which would lead to an overall observation of female-bias 
(Supplemental Figure 8A). For the CEU population, 48% of X chromosome sequences have 
coalesced by the time of initial African-non-African separation (around lOOky ago), and, only 
22% of X chromosome sequence has a coalescence time during the female bias epoch, a 
proportion significantly smaller than that of the YRI population (p = 5.8* 1 0" 5 ) (Supplemental 
Figure 8B). We also note that greater a values exaggerate the female bias in the ancestral human 
lineage, but do not eliminate the male bias observed for YRI population (Supplemental Figure 
9A-B). Our result supports an ancestral female bias in the human lineage [33,35]. The observed 
male bias in the YRI lineage may not have been found by other studies because the proportion of 
extant genomes that coalesce at this time interval is quite small and when considering the overall 
X/A diversity ratio, this signal from a small proportion of the genome can be easily diluted. We 
also believe that the observed differences are not entirely explained by the differential responses 
of X and autosomal diversity to ancestral population size changes [57]. When simulating 
population history and scale N e by % to mimic the effective population size for X chromosome, 
the inferred population history by X chromosome after scaling back 3 /4 recovers the autosomal 
history (Supplemental Figure 10A). The male bias in the YRI lineage could result from any 
population history in which the female effective population size decreased prior to that of males 
(Supplemental Figure 10B). However, as this phenomenon was not observed in a European 
population, during a time period when we infer that African and non-African populations are not 
separated. Other factors, such as sex-biased migration or natural selection acting differently on 
autosomes and X chromosomes in some populations may contribute to this observation. Our 
results highlight the importance of categorizing segments of genomes by coalescence time when 
comparing autosome and X chromosome instead of looking at a global X/A ratio. We expect that 
a more refined view of these historic processes will emerge as additional data from more 
populations, particularly in Africa, become available. 

Materials and Methods 

Whole genome SNP and indel genotyping 

Two genome libraries were constructed based on the standard library preparation and the Illumina 
Nextera framgentase system following the manufacturers protocols and sequenced on Illumina 
Hi-Seq. Paired end reads were aligned to reference genome assembly (GRCh37, with the 
pseudoautosomal regions of the Y chromosome masked to 'N') using BWA v0.5.9-rl6. PCR 
duplicates were removed by Picard vl.62. Reads in regions with known indels were locally 
realigned and base quality scores were recalibrated using GATK [58]. SNPs and indels were 
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called by GATK UnifiedGenotyper v2.3-9. We retained high-quality SNP positions by applying 
Variant Quality Score Recalibration implemented in GATK to select a SNP set that included 99% 
of sites that intersect with the HapMap, 1000 Genomes and dbSNP training set. The resulting 
heterozygous SNPs are the starting point for subsequent haplotype phasing. 

Fosmid clone identification 

Two fosmid libraries were constructed from genomic DNA obtained from Coriell for sample 
NA 19240. Aliquots of 10 ug of DNA were sheared in 120 ul volumes on a Digilab Hydroshear 
for 60 cycles at a speed code of 16. Sheared DNA was loaded and ran on a pulse field gel at 
200V for 26 hours with 0.5s- 15s switching. DNA from 25kb-45kb was cut out of the gel and 
isolated by electroelution for 12 hours at 120 V. After electroelution, DNA was isolated with 
Ampure XP beads, end-repaired with the Epicentre End-It kit and ligated to the Epicentre 
pCClFos fosmid arms. The resulting ligation was packaged and transfected into the Phage T-l 
Resistant EPI300-T1 E. coli plating strain (Catalog Number CCFOS1 10). One hour after 
transfection, the resulting cells were split into the appropriate volumes to give pools of 1,500- 
3,000 cells per pool. Barcoded Nextera libraries for sequencing were constructed from mini- 
prepped DNA from each pool and sequenced on Illumina HiSeq. 

Reads were mapped to reference assembly including human genome (GRCh37/hgl9), Epstein 
Barr virus, the E. coli genome and fosmid vector backbone using BWA v0.5.9-rl6. Candidate 
fosmid clones were identified by computing read-depth in lk bp windows for each clone pool and 
merging consecutive windows allowing a maximum gap of 3 windows. Reads where one end 
mapped to the fosmid vector backbone and another end mapped to human genome, called 
anchoring reads, were used to better assign clone breakpoints. Based on mechanism of fosmid 
library construction, anchoring read ends mapped to the left coordinate of the reference genome 
are always reverse oriented, and anchoring read ends mapped to the right coordinate of reference 
genome are always forward oriented. Observing anchoring reads in the middle of consecutive 
windows identified overlapping clones. Overlapping clones were excluded from downstream 
analysis. 

Haplotype reconstruction 

Each clone pool was separately genotyped at heterozygous SNPs called by whole genome 
shotgun sequencing using GATK UnifiedGenotyper v2.3-9. Clones covering one or more 
heterozygous SNP positions shown in whole genome were used to resolve haplotype in next stage. 
A small proportion of clones (8.1%) were genotyped as heterozygous, probably resulted from 
overlapping clones or mapping errors. These clones were excluded from further analysis. 

We applied ReFHap [27], an efficient algorithm for Single Individual Haplotyping. This 
algorithm only takes clone fragments that contain at least one heterozygous SNP and seeks to 
bipartite clone fragments into two sets that maximize the difference between two sets. The 
algorithm first builds a graph where fragments are linked upon sharing positions and a score is 
assigned on the edge indicating how different two fragments are. RefHap then applies a heuristic 
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algorithm to find the bipartition maximizing the cut, which can also be formulated into an NP- 
hard Max-CUT problem. Finally, the algorithm generates consensus haplotype within one 
partition and flips every allele to get the other haplotype. However, if a site is observed equally in 
each partition, it will remain undecided resulting in gaps in the haplotype. 

NA 19240 was the child of a trio that HapMap had genotyped, thus phasing information is 
determined at genotyped SNPs except for cases where all individuals in the trio are heterozygous. 
We used phase-determined SNPs to guide paternal and maternal allele assignment within blocks. 
In order to differentiate switch errors and HapMap phasing errors, we first divide blocks into 
smaller blocks if there is limited supporting information from clones and then determine paternal 
and maternal allele based on the majority of HapMap phased SNP assignments. For sample 
NA20847, HapMap3 had phase prediction using a population-based statistical method. We 
compared our haplotypes with HapMap phase prediction to assign haplotype within blocks as 
either haplotype 1 or haplotype2. However, we only assign haplotypes when a majority of SNPs 
agree, and tabulated switch errors between our haplotype and HapMap3 prediction. 

Once we have assigned haplotypes within blocks to paternal and maternal allele (haplotype 1 and 
haplotype2 as in NA20847), we extracted original reads from each haplotype-assigned fosmid 
clone and merged them to create haplotype-specific BAMs. We use GATK's UnifiedGenotyper 
v2.3-9 with 'EMIT_ALL_SITES' option to make calls at each position for each haplotype. We 
filtered calls with MQ less than 20 and DP less than 5 or larger than 80 and within 5bp around a 
short indel. Only homozygous calls in each haplotype-specific BAM are kept, and we assign the 
inferred allele from the ReFHap if the call appears heterozygous. Next, we combined calls from 
the merged haplotype BAMs and whole genome sequencing BAM file, and also include BAM 
files from all other clones that were not incorporated due to a lack of heterozygous SNPs to 
capture regions that may not be covered in the whole genome sequencing. 

For sample NA12878, we used the phased SNP haplotypes constructed by fosmid-pool 
sequencing from a previous study [27] , downloaded from http ://www. mol gen. mp g. de/~genetic - 
variation/SIH/data , together with the sequencing results from the 1000 Genomes Project, to 
construct full haplotypes. The distributions of callable regions of each chromosome for the three 
individuals are shown in Supplemental Figure 3. 

Indel phasing 

From haplotype-specific BAMs, we perform indel discovery using GATK. We preformed indel 
calling using calls from the whole genome indel call set as given alleles as well as calling on 
fosmid pools without using indel alleles from the whole genome data (de novo). We performed 
comparison on indel discovery as well as indel phasing. 

PSMC 

PSMC inference was performed as previously described [28]. On autosomal data, we use the 
default setting with T max =15, n=64, and pattern '1*4+25*2+1*4+1*6'. For X-chromosome data, 
we set T max =15, n=60, and pattern '1*6+2*4+1*3+13*2+1*3+2*4+1*6'. We also scale the 
population size inferred by chromosome X by 3 A and set the mutation rate of X chromosome as 
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jjX = [lA—, r- where a is the ratio of male-to-female mutation rate, range from 2 to 5. For X 

3(1 + a) 

chromosome, we also excluded pseudo autosomal region for PSMC analysis because this region 
is diploid in males and has very high recombination rate and sequence diversity. We performed 
bootstrapping to measure the variance of estimates by splitting consensus sequence into lOMbp 
segments and randomly resampling 300 segments with replacement and then rerunning PSMC on 
the resampled segments. Such bootstrapping was repeated 100 times to obtain 95% confidence 
intervals for the estimates. When applying PSMC on pseudo-diploid individuals to infer split time, 
we use the same pattern of the autosome for X chromosome, in order to give finer resolution at 
the times of interested to and enable easy comparison. 

We simulated 100 10M sequences of three individuals (representing African, European and Asian) 
according to Gravel's model[30] using msHot software[59]. For X chromosomes, we scale the 
effective population size by 3 A. We ran PSMC on simulated sequences, for each individual and 
pseudo-diploid individual. We also simulate a scenario in which the gene flow after the initial 
split of African and non African ends at 46ky, 60ky and 80ky ago. 

We downloaded whole genome data for individuals sequenced by Complete Genomics from 
ftp2.completegenomics.com . These data were generated and analyzed with Complete Genomics' 
local de novo assembly-based pipeline [49]. We used the alignments against NCBI build 37 and 
data processed with pipeline version 1.10. We analyzed the CG 'masterVarBeta' files and 
selected single-nucleotide variants (SNVs) and masked 'no-call' regions as well as insertions, 
deletions, substitutions, or partial (half) calls. For each time interval, we calculated the ratio of 
effective population of X chromosome to autosome and test if it significantly deviates from 0.75. 
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Figure Legends 

Figure 1. Haplotype assembly results. Half of the assembled sequence is present in blocks 
longer than 318.5 kbp for NA19240 and 378.7 kbp for NA20847. 

Figure 2. Variants call set comparison with other data sets (A) Heterozygous SNP call set 
comparison (B) Indel call set comparison. Intersection refers to same site and same alleles. 
Fosmid call set refers to indel discovery using fosmid clones pool sequencing. 

Figure 3. PSMC estimates on phased haplotypes. (A) Population history inferred by phased 
autosome and X chromosome sequences from three individuals. NA 19240 represents YRI 
population, NA12878 represents CEU population, NA20847 represents GIH population. (B) 
PSMC result on pseudo-dip loid genome of YRI and CEU. (C) PSMC result on pseudo-dip loid 
genome of YRI and GIH. (D) PSMC result on pseudo-diploid genome of CEU and GIH. 1 stands 
for paternal allele and 2 stands for maternal allele when making pseudo-diploid genome. For the 
X chromosome, estimates are scaled by 3 A and assuming a value of 2. Bootstrapping was 
performed 100 times (green for autosomes, orange and light blue for X chromosomes). 

Figure 4. X to autosome effective population size differences (A) YRI population. (B) CEU 
population. Q refers to ratio of effective population size of X chromosome to autosome. Q larger 
than 0.75 indicates female bias while Q smaller than 0.75 indicates male bias. 

Supplemental Figure 1. Inferred clone statistics (A) Distribution of inferred clone insert sizes 
for two independent librarys (library 1 with median 34kbp, library2 with median 31kbp) and 
distribution of lkb window coverage of inferred clone (median 1.95). (B) Distribution of number 
of clones covered per lkb window (median 5) and distribution of overall coverage per lkb 
window (median 17.9x). 

Supplemental Figure 2. Illustration of ReFHap's phasing result and a switch error. Each 
column corresponds to a SNP position, with blue indicating the reference allele and red the 
alternative. The first two rows are the haplotype prediction by ReFHap, followed by four rows 
showing HapMap phase based on trio transmission. This is followed by 12 rows depicting clone 
genotypes. The last row indicates the parental allele assigned for RefHap haplotype based on 
HapMap phasing. In the last row, blue indicates paternal allele and red indicates maternal allele. 
The line with a star shows where the switch error occurred. 

Supplemental Figure 3. Callable size of each chromosome of three individuals based on 
phasing results. 

Supplemental Figure 4. Venn diagram of indel phasing comparison. Intersection refers to 
indel calls with same site, allele and genotype. Fosmid (wgs) refers to calling indels from fosmid 
pools with whole genome sequencing indel call set as given alleles. Fosmid (de novo) refers to 
calling indels from fosmid pools without given alleles. 

Supplemental Figure 5. PSMC comparisons. (A). PSMC estimate on NA 19240 by three 
different sequencing methods. First two are haplotypes resolved by fosmid-pool sequencing. 
Second are from 1000 Genomes sequencing file. Third are from Complete-Genomics. (B) PSMC 
result on pseudo-diploid genome of YRI and CEU. (C) PSMC result on pseudo-diploid genome 
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of YRI and GIH. (D) PSMC result on pseudo-diploid genome of CEU and GIH. 1 stands for 
paternal allele and 2 stands for maternal allele when making pseudo-diploid genome. NA 19240 
represents YRI population, NA12878 represents CEU population, NA20847 represents GIH 
population. 

Supplemental Figure 6. Simulation study on PSMC's split-time inference. We useed msHOT 
to simulate 3 individuals from African, European, and Asian population assuming Gravel's model. 
(A) simulated autosome sequences, with migration event. (B) simulated X chromosome 
sequences (multiply N value by 0.75), with migration event. (C) simulated autosome sequences, 
without migration event. (D) simulated X chromosome sequences (multiply N value by 0.75), 
without migration event. In the simulation, the European and Asian population experienced the 
same bottleneck event and then went through exponential growth with different rates. The 
African population size remained the same after an ancestral population growth. 

Supplemental Figure 7. Inferred TMRCA distributions on cross-population pseudo-diploids 

(A) Tail TMRCA distribution for simulated data. (B) Tail TMRCA distribution for real data. We 
simulated 100 1 OM sequences of two individuals (representing African and European) according 
to Gravel's model. We set gene flow after the initial split of African and European ending at 46ky, 
60ky and 80ky ago. 

Supplemental Figure 8 Infered TMRCA distributions. (A) TMRCA distribution for 
autosomes and X chromosomes of YRI population (B) TMRCA distribution for autosomes and X 
chromosomes of the CEU population. Region 1 is the time interval of male bias found in YRI 
population and region 2 is the time interval of female bias found in ancestral lineage.. 

Supplemental Figure 9. X to autosome effective population size differences when setting a = 

5. (A) YRI population. (B) CEU population. Q refers to ratio of effective population size of X 
chromosome to autosome. Q larger than 0.75 indicates female bias while Q smaller than 0.75 
indicates male bias. 

Supplemental Figure 10. (A) chrX can recover the same history after scaling back effective 
population size and time by 3 /4 when male and female effective population size is the same (B) 
When male and female effective population size is different, scaling by 3 /4 will give rise to 
different population history 
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Tables 

Table 1 Phasing statistics. 



Sample 


#clones 
after filter 


#blocks 


MEC 
value 


#SNPs to 
be phased 


% phased 
SNPs 


# blocks 
assigned by 
HapMap 
and 1000G 


Switch 
Error 


Switch 

Error 

rate 


NA 19240 


521,783 


17,388 


38,903 


2,405,813 


92.84% 


16,138 


506 


0.09% 


NA20847 


571,419 


16,708 


21,207 


1,681,829 


94.39% 


8,718 


2751 


0.84% 



MEC is the number of entries to correct when resolving haplotypes. Switch error is an 

inconsistency between an assembled haplotype and the real haplotype between two contiguous 

variants, normalized by number of variants for comparison. 

99.7% of 1000G SNPs shown in our data is consistent with our phasing. 

Switch Errors in NA 19240 reflect switch error by ReFHap; while switch errors in NA20847 

reflect relative switch error between HapMap's population based phasing method and ReFHap. 



Table 2 Population Split-time estimation based on pseudo-diploid individuals. 





autosome 


chrX (a = 2) 




Initial Split 


End of Gene flow 


Initial Split 


End of Gene flow 




Time (ky) 


tail 

percentage 


Time (ky) 


tail 

percentage 


Time (ky) 


tail 

percentage 


Time (ky) 


tail 

percentage 


YRI-GIH 


94.1 


2.20% 


51.7 


0.04% 


97.7 


4.60% 


49.3 


0.03% 




(92.7-95.8) 


(1.9%-2.4%) 


(50.9-52.6) 


(0.02%- 
0.06%) 


(93.1-100.6) 


(1.5%-6.3%) 


(46.8-50.7) 


(0.0%- 
0.12%) 


YRI-CEU 


95.6 


1.80% 


52.5 


0.06% 


92.6 


4.70% 


46.6 


0.002% 




(94.0-97.0) 


(1.6%-2.0%) 


(51.7-53.4) 


(0.02%- 
0.09%) 


(89.2-95.8) 


(3.7%-5.8%) 


(44.9-48.3) 


(0.000%- 
0.007%) 


CEU-GIH 


45.6 


0.80% 


27.9 


0.0002% 


45.2 


3.80% 


30.9 


0.02% 




(44.5-46.5) 


(0.5%-1.0%) 


(27.2-28.4) 


(0.000%- 
0.0006%) 


(41.7-48.1) 


(0.9%-7.2%) 


(28.5-33.0) 


(0.0002%- 
0.07%) 



Estimates are in thousand years, a is the ratio of male-to-female mutation rates, which are used 
for time scaling on X chromosomes (See Methods). 
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Supplemental Table 2. Summary of variant calling for whole genome sequencing 





Pre Filter variants 


Post Filter variants 


SNP 






Total 


4,495,127 


3,798,082 


Homozygous 


1,561,566 


1,392,269 


Heterozygous 


2,933,561 


2,405,813 


IndbSNP132 


4,178,217 


3,731,358 


Ti/Tv Ratio 




2.1 


heterozygous sensitivity compared with 
HapMap 


90.00% 


88.40% 


heterozygous sensitivity compared with 
1000G 


89.60% 


87.10% 


genotype concordance with HapMap 


99.40% 


99.50% 


genotype concordance with 1 000G 


99.20% 


99.20% 


Heterozygous SNP genotype 
concordance with HapMap 


99.49% 


99.50% 


Heterozygous SNP genotype 
concordance with 1000G 


99.98% 


99.98% 


f^nriincy variants 






synonymous 




12,980 


non-synonymous 




11,002 


stop-gain 




72 


stop-loss 




12 


Short Indels 






Total 


429,501 




IndbSNP132 


334,097 
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Supplemental Table 3. Comparison of our haplotypes with the sequence of 33 fosmid 
clones from the same individual that were previously sequenced using standard capillary 
sequencing(hapl refers to paternal allele, hap2 refers to maternal allele) 



clone name 


chr 


posl 


pos2 


strand 


# tret SNP 


Het 

mismatch 


All 

mismatch 


Length of 
clone 


eiror rate 


haplotype 
assignment 


AC203596 


20 


60332560 


60367311 


- 


63 


0 


2 


33701 


0.000 


hap2 


AC208180 


7 


109124048 


109164730 


- 


6 


0 


17 


35768 


0.000 


hapl 


AC203618 


14 


24625251 


24665998 


+ 


43 


0 


0 


40532 


0.000 


hapl 


AC203625 


3 


13175578 


13207238 


+ 


52 


0 


1 


31376 


0.000 


hap2 


AC203613 


17 


42388348 


42422435 


+ 


11 


0 


0 


33208 


0.000 


hap2 


AC209301 


5 


103498132 


103533083 


+ 


42 


1 


5 


30147 


0.024 


hap2 


AC2 11777 


2 


131597914 


131644819 


+ 


57 


0 


1 


40161 


0.000 


hap2 


AC207436 


20 


61771673 


61805848 


- 


25 


0 


1 


33436 


0.000 


hap2 


AC203629 


2 


27532275 


27572319 


+ 


27 


0 


0 


39746 


0.000 


hapl 


AC203601 


13 


84295142 


84330569 


+ 


21 


7 


10 


31171 


0.333 


hapl 


AC2 14990 


20 


1835027 


1870122 


+ 


32 


0 


0 


34822 


0.000 


hap2 


AC203623 


13 


50529642 


50571393 


- 


19 


0 


0 


40523 


0.000 


hap2 


AC209312 


19 


11076696 


11117117 


+ 


9 


0 


0 


39917 


0.000 


hapl 


AC204964 


20 


36206659 


36240122 


- 


50 


0 


5 


33213 


0.000 


hap2 


AC203663 


12 


120664025 


120704371 


- 


11 


0 


0 


39941 


0.000 


hapl 


AC203633 


15 


83719658 


83761946 


+ 


30 


1 


1 


40789 


0.033 


hap2 


AC207998 


5 


42518686 


42548692 


+ 


14 


0 


2 


28952 


0.000 


hapl 


AC204962 


10 


73293352 


73327357 


- 


2 


0 


0 


33966 


0.000 


hap2 


AC203585 


12 


127683544 


127718788 


- 


26 


0 


1 


33767 


0.000 


hapl 


AC207584 


22 


24202000 


24237415 


- 


48 


0 


0 


35168 


0.000 


hapl 


AC203595 


17 


15129257 


15154073 


- 


25 


0 


0 


24341 


0.000 


hapl 


AC204968 


12 


108020551 


108054912 




44 


1 


1 


33763 


0.023 


hapl 


AC214217 


20 


34745579 


34780635 




13 


0 


0 


34920 


0.000 


hapl 


AC203614 


13 


27271316 


27306555 




28 


0 


0 


32959 


0.000 


hapl 


AC226164 


17 


8937695 


8978811 


+ 


46 


0 


0 


40616 


0.000 


hap2 


AC203609 


17 


9512088 


9545453 


+ 


22 


0 


0 


27053 


0.000 


hap2 


AC213115 


2 


5679730 


5713339 




21 


1 


1 


32705 


0.048 


hapl 


AC210876 


17 


3897067 


3932032 




52 


0 


1 


34584 


0.000 


hap2 


AC215711 


8 


142371608 


142412790 




74 


0 


2 


40842 


0.000 


hap2 


AC207992 


4 


162086171 


162119199 


+ 


31 


0 


1 


30427 


0.000 


hap2 


AC204957 


7 


135193786 


135227398 


+ 


26 


2 


2 


32900 


0.077 


hap2 


AC209156 


12 


104071670 


104106822 


+ 


15 


0 


0 


34775 


0.000 


hapl 


AC208068 


16 


85067617 


85102315 




122 


0 


2 


34548 


0.000 


hap2 



A total of 13 out of 1 107 heterozygous SNP error occurred when assigning 33 clones into haplotype, an 
error rate of 1.2%. 56 substitution errors out of 1,144,737 bp total sequences yielded a sequence error rate 
of 0.005%. 
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Supplemental Table 4. Phased indel call set comparison. Fosmid (wgs) refers to 
calling indels from fosmid pools with guidance of whole genome sequencing indel call 
set. Fosmid (de novo) refers to calling indels from fosmid pools directly. 





1000 Genomes 


fosmid with wgs 


fosmid denovo 


fosmid combined 


total 


363092 


310296 


241878 


383480 


homozygous 


136415 


156731 


102018 


173116 


paternal indel 


112505 


83904 


68425 


111449 


maternal indel 


106832 


68292 


68399 


95727 


biallelic indel 


7340 


1369 


3036 


3188 



Supplemental Table 5. Split time estimation from previous studies. Reported 
estimates are adjusted by using the same mutation rate 1.2*10" 8 bp/generation. 



Paper 


Method 


African & 
Non-African 
split time 


African & Non- 
African split 
time (adjusted) 


European-Asian 
split time 


European-Asian 
split time 
(adjusted) 


Gravel et al, 
2011 


diffusion, AFS 


51,000 


100,300 


23,000 


45,233 


Gronau et al, 
2011 


Bayesian 
coalescent based 


47,000 


78,333 


36,000 


60,000 


Harris et al, 
2013 


IBS sharing 


55,000 


88,000 
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